AITopics | gold standard

Collaborating Authors

gold standard

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Neural Information Processing SystemsMar-20-2026, 20:47:45 GMT

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery.Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, .

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh

Neural Information Processing SystemsFeb-14-2026, 15:02:20 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, assurance, optimization, (15 more...)

Neural Information Processing Systems

Country:

Oceania > Australia (0.14)
North America > Canada > Quebec > Montreal (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

b5d17ed2b502da15aa727af0d51508d6-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 22:57:49 GMT

annotation, dataset, reliability, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.72)
Information Technology > Artificial Intelligence > Machine Learning (0.72)

Add feedback

Bench4KE: Benchmarking Automated Competency Question Generation

Lippolis, Anna Sofia, Ragagni, Minh Davide, Ciancarini, Paolo, Nuzzolese, Andrea Giovanni, Presutti, Valentina

arXiv.org Artificial IntelligenceDec-10-2025

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.24554

Country:

Europe > Italy (0.47)
North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh

Neural Information Processing SystemsNov-20-2025, 20:06:40 GMT

Using two real-world applications, we demonstrate the efficiency of our methods.

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Oceania > Australia (0.14)
North America > Canada > Quebec > Montreal (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Save 30 on This All-Clad Nonstick Frying Pan Set

WIREDOct-29-2025, 19:39:14 GMT

Life is too short to use bad nonstick cookware. These All-Clad pans are the gold standard, and they're less expensive than usual. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. It can be hard to build an Adulting Arsenal.

all-clad nonstick frying pan set, nonstick frying pan set, pan set, (14 more...)

WIRED

Country:

North America > United States > New Mexico (0.05)
North America > United States > California (0.05)
North America > Mexico > Mexico City > Mexico City (0.05)
(2 more...)

Industry:

Transportation (0.52)
Retail (0.36)

Technology:

Information Technology > Communications (0.48)
Information Technology > Artificial Intelligence (0.48)

Add feedback

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Mealey, Kathleen P., Karr, Jonathan A. Jr., Moreira, Priscila Saboia, Brenner, Paul R., Vardeman, Charles F. II

arXiv.org Artificial IntelligenceOct-28-2025

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.nlp.2025.100187

2507.22935

Country:

Europe (0.92)
North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Transportation > Air (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Aerospace & Defense > Aircraft (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Thelwall, Mike, Mohammadi, Ehsan

arXiv.org Artificial IntelligenceOct-28-2025

Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.

correlation, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2510.22389

Country:

Europe > United Kingdom (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

Hariri, Mohsen, Samandar, Amirhossein, Hinczewski, Michael, Chaudhary, Vipin

arXiv.org Machine LearningOct-7-2025

Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit

arxiv preprint arxiv, bayes, evaluation, (13 more...)

arXiv.org Machine Learning

2510.04265

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Ohio > Cuyahoga County > Cleveland (0.04)
South America > Suriname > Marowijne District > Albina (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.82)

Add feedback

Filters

Collaborating Authors

gold standard

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

b5d17ed2b502da15aa727af0d51508d6-AuthorFeedback.pdf

4ea14e6090343523ddcd5d3ca449695f-Supplemental-Datasets_and_Benchmarks.pdf

Bench4KE: Benchmarking Automated Competency Question Generation

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Save 30 on This All-Clad Nonstick Frying Pan Set

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation